Goto

Collaborating Authors

 lda algorithm


Latent Dirichlet Allocation

#artificialintelligence

Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a predefined number of clusters (i.e. These topics can then be used to organize and search through documents. The most popular methods for estimating the LDA model is Gibbs sampling. Let's walk through one iteration of the algorithm.


AI supported Topic Modeling using KNIME-Workflows

Qundus, Jamal Al, Peikert, Silvio, Paschke, Adrian

arXiv.org Artificial Intelligence

Topic modeling algorithms traditionally model topics as list of weighted terms. These topic models can be used effectively to classify texts or to support text mining tasks such as text summarization or fact extraction. The general procedure relies on statistical analysis of term frequencies. The focus of this work is on the implementation of the knowledge-based topic modelling services in a KNIME workflow. A brief description and evaluation of the DBPedia-based enrichment approach and the comparative evaluation of enriched topic models will be outlined based on our previous work. DBpedia-Spotlight is used to identify entities in the input text and information from DBpedia is used to extend these entities. We provide a workflow developed in KNIME implementing this approach and perform a result comparison of topic modeling supported by knowledge base information to traditional LDA. This topic modeling approach allows semantic interpretation both by algorithms and by humans.


LDA for Text Summarization and Topic Detection - DZone AI

#artificialintelligence

Machine learning clustering techniques are not the only way to extract topics from a text data set. Text mining literature has proposed a number of statistical models, known as probabilistic topic models, to detect topics from an unlabeled set of documents. One of the most popular models is the latent Dirichlet allocation (LDA) algorithm developed by Blei, Ng, and Jordan [i]. LDA is a generative unsupervised probabilistic algorithm that isolates the top K topics in a data set as described by the most relevant N keywords. In other words, the documents in the data set are represented as random mixtures of latent topics, where each topic is characterized by a Dirichlet distribution over a fixed vocabulary.


Fast Online Incremental Learning on Mixture Streaming Data

Wang, Yi (Dalian University of Technology) | Fan, Xin (Dalian University of Technology) | Luo, Zhongxuan (Dalian University of Technology) | Wang, Tianzhu ( No. 254, Deta Leisure Town,  Jinzhou New District, Dalian ) | Min, Maomao (Dalian University of Technology) | Luo, Jiebo (University of Rochester)

AAAI Conferences

The explosion of streaming data poses challenges to feature learning methods including linear discriminant analysis (LDA). Many existing LDA algorithms are not efficient enough to incrementally update with samples that sequentially arrive in various manners. First, we propose a new fast batch LDA (FLDA/QR) learning algorithm that uses the cluster centers to solve a lower triangular system that is optimized by the Cholesky-factorization. To take advantage of the intrinsically incremental mechanism of the matrix, we further develop an exact incremental algorithm (IFLDA/QR). The Gram-Schmidt process with reorthogonalization in IFLDA/QR significantly saves the space and time expenses compared with the rank-one QR-updating of most existing methods. IFLDA/QR is able to handle streaming data containing 1) new labeled samples in the existing classes, 2) samples of an entirely new (novel) class, and more significantly, 3) a chunk of examples mixed with those in 1) and 2). Both theoretical analysis and numerical experiments have demonstrated much lower space and time costs (2~10 times faster) than the state of the art, with comparable classification accuracy.


Towards Big Topic Modeling

Yan, Jian-Feng, Zeng, Jia, Liu, Zhi-Qiang, Gao, Yang

arXiv.org Machine Learning

To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture.